✏️

Week 3 lecture notes

Reviewing introductory papers on interpretability and linguistic probes inside the black box of neural language models

Attention (Transformer architecture)

Attention can stand in for multiple transformations. For the concept of self-attention, we focus on Vaswani et al. (2017)-style computations which rely on a Query-Key-Value framework. The general idea is that we want to compute token embeddings

  • Vaswani et al. (2017): “An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.”
  • Three weight matrices:
    • WQW_Q, WKW_K, WVW_V for queries, keys, and values
    • qiq_i, kik_i are representations of a token tit_i whose embedding eie_i is multiplied by WQW_Q and WKW_K, respectively
    • Multiplying QQ by KTK^T, we obtain similarity scores; square but asymmetrical matrix
    • Pass similarity scores through a softmax and multiply by VV, which projects back into the size of the vocabulary
  • As input to the next layer, we need something that is the hidden state size dd but which reweights all the eie_is — that’s what VV does
  • Wikipedia: The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.

Quick demo of Transformers

  • extracting both hidden states, self-attentions, and intermediate predictions (sometimes called “attentions”)
  • using huggingface for BERT

Week 2 Readings: Hidden states

Tenney, I., Das, D., & Pavlick, E. (2019, July). BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 4593-4601). https://aclanthology.org/P19-1452/
  • Probing procedure described in Tenney et al. (2019, ICLR)
    • Formally, we represent a sentence as a list of tokens T=[t0,t1,...,tn]T = [t_0,t_1,...,t_n], and a labeled edge as s(1),s(2),L{s(1),s(2),L}. We treat s(1)=[i(1),j(1))s(1) = [i(1),j(1)) and, optionally, s(2)=[i(2),j(2))s(2) = [i(2),j(2)) as (end-exclusive) spans. For unary edges such as constituent labels, s(2) is omitted. We take L to be a set of zero or more targets from a task-specific label set L.
    • The model is designed to have limited expressive power on its own, as to focus on what information can be extracted from the contextual embeddings. We take a list of contextual vectors [e0,e1,...,en][e_0,e_1,...,e_n] and integer spans s(1)=[i(1),j(1))s(1) = [i(1),j(1)) and (optionally) s(2)=[i(2),j(2))s(2) = [i(2),j(2))
  • Scalar mixing weights to learn a weighted sum of the layers used in different tasks
    • To pool across layers, we use the scalar mixing technique introduced by the ELMo model. Following Equation (1) of Peters et al. (2018a), for each task we introduce scalar parameters γτγ_τ and aτ(0),aτ(1),...,aτ(L)a^{(0)}_τ ,a^{(1)}_τ ,...,a^{(L)}_τ ,and let: hi,τ=γτΣl=0Lsτ(l)hi(l)\mathbf{h}_{i,τ} = γ_τ \Sigma_{l=0}^{L} s^{(l)}_τ \mathbf{h}^{(l)}_i where sτ=softmax(aτ)s_τ = \text{softmax}(a_τ).
  • Cumulative scoring allows a probing model to combine predictions from multiple layers
    • we train a series of classifiers Pτ(l){P^{(l)}_τ}which use scalar mixing (Eq. 1) to attend to layer as well as all previous layers. Pτ(0)P^{(0)}_τ corresponds to a non-contextual baseline that uses only a bag of word(piece) embeddings, while P(L) τ = Pτ corresponds to probing all layers of the BERT model. τ These classifiers are cumulative, in the sense that P(+1) has a similar number of parameters but with access to strictly more information than Pτ(l)P^{(l)}_τ ,
    • can then compute a differential score τ(l)∆^{(l)}_τ , which measures how much better we do on the probing task if we observe one additional encoder layer : τ(l)=Score(Pτ(l))Score(Pτ(l1))∆^{(l)}_τ =\text{Score}(P^{(l)}_τ)−\text{Score}(P^{(l−1)}_τ)
Durrani, N., Sajjad, H., Dalvi, F., & Belinkov, Y. (2020, November). Analyzing Individual Neurons in Pre-trained Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 4865-4880). https://aclanthology.org/2020.emnlp-main.395/
  • Search
    • Algorithm for selecting lambdas (regularization weights) in their Elasticnet implementation
    💡
    Depends on a function of four parameters:
    • We then compute score for each lambda set (λ1\lambda_1, λ2\lambda_2) as: S(λ1\lambda_1,λ2\lambda_2) = α(AtAb)β(AzAl)\alpha(A_t - A_b) - \beta(A_z - A_l)
    • First term
      • The first term ensures that we select a lambda set where accuracies of top and bottom neurons are further apart
      • AtA_t is the accuracy of the classifier retaining top neurons and masking the rest,
      • AbA_b is the accuracy retaining bottom neurons,
    • Second term
      • the second term ensures that we prefer weights that incur a minimal loss in classifier accuracy due to regularization.
      • AzA_z is the accuracy of the classifier trained using all neurons but without regularization, and
      • AlA_l is the accuracy with the current lambda set.
    • Set α\alpha and β\beta to be 0.5 in our experiments.
    • NB: Elasticnet uses L1 and L2 regularization with a ratio of each penalty to determine how strongly to regularize learned weights (e.g., reduce high coefficients and/or force many weights to 0)
  • Neuron ranking algorithm
    • “We use the neuron ranking algorithm as described in Dalvi et al. (2019). Given the trained classifier θRD×T\theta \in \R^{D×T}, the algorithm extracts a ranking of the D neurons in the model MM. For each label tt in task TT, the weights are sorted by their absolute values in descending order. To select NN most salient neurons w.r.t. the task TT, an iterative process is carried. … until the set reaches a specified NN
  • Minimal neuron selection
    1. “Train a classifier to predict the task using all the neurons (call it Oracle),
    1. Obtain a neuron ranking based on the ranking algorithm described above,
    1. Choose the top N neurons from the ranked list and retrain a classifier using these,
    1. Repeat step 3 by increasing the size of NN, until the classifier obtains an accuracy close (not less than a specified threshold δ\delta) to the Oracle.”

Week 2 readings: Self-attentions

Sarthak Jain and Byron C. Wallace. 2019. Attention is not Explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3543–3556, Minneapolis, Minnesota. Association for Computational Linguistics. https://aclanthology.org/N19-1357/
  • “Attentions should correlate with feature importance measures (e.g., gradient-based measures)”
  • Alternate, parameter shuffling of attentions should mess up the model’s ability to predict things
    • However, note, that an alternate weighting of attentions only affects the scalar multiplier to the next layer, the embeddings are still passed upward
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Association for Computational Linguistics. https://aclanthology.org/D19-1002/
  • Time permitting!